EXPLORE & SUMMARIZE DATA | White Wine Quality Analysis
Aurore Dupont
========================================================

Introduction

In this project, we will be exploring a dataset about the quality of white wines, using exploratory data analysis techniques to explore relationships in one variable to multiple variables in R.

This dataset is public available for research. The details are described in [Cortez et al., 2009].
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at:
- Elsevier
- Pre-press (pdf)
- bib

Descriptions of the variables

Input variables (based on physicochemical tests):
- Fixed acidity (tartaric acid - g / dm^3): Most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
- Volatile acidity (acetic acid - g / dm^3): The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
- Citric acid (g / dm^3): Found in small quantities, citric acid can add ‘freshness’ and flavor to wines
- Residual sugar (g / dm^3): The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
- Chlorides (sodium chloride - g / dm^3): The amount of salt in the wine
- Free sulfur dioxide (mg / dm^3): The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
- Total sulfur dioxide (mg / dm^3): Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
- Density (g / cm^3): The density of water is close to that of water depending on the percent alcohol and sugar content
- pH: Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
- Sulphates (potassium sulphate - g / dm3): A wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
- Alcohol (% by volume): The percent alcohol content of the wine

Output variable (based on sensory data):
- Quality: Score between 0 and 10

Several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.


Univariate Plots Section

We are starting with some preliminary exploration of the dataset.
Summaries of the data and univariate plots will allow us to understand the structure of the individual variables in the dataset.

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.0             0.27        0.36           20.7     0.045
## 2 2           6.3             0.30        0.34            1.6     0.049
## 3 3           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## [1] 4898   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##                      vars    n    mean      sd  median trimmed     mad
## X                       1 4898 2449.50 1414.08 2449.50 2449.50 1815.44
## fixed.acidity           2 4898    6.85    0.84    6.80    6.82    0.74
## volatile.acidity        3 4898    0.28    0.10    0.26    0.27    0.09
## citric.acid             4 4898    0.33    0.12    0.32    0.33    0.09
## residual.sugar          5 4898    6.39    5.07    5.20    5.80    5.34
## chlorides               6 4898    0.05    0.02    0.04    0.04    0.01
## free.sulfur.dioxide     7 4898   35.31   17.01   34.00   34.36   16.31
## total.sulfur.dioxide    8 4898  138.36   42.50  134.00  136.96   43.00
## density                 9 4898    0.99    0.00    0.99    0.99    0.00
## pH                     10 4898    3.19    0.15    3.18    3.18    0.15
## sulphates              11 4898    0.49    0.11    0.47    0.48    0.10
## alcohol                12 4898   10.51    1.23   10.40   10.43    1.48
## quality                13 4898    5.88    0.89    6.00    5.85    1.48
##                       min     max   range skew kurtosis    se
## X                    1.00 4898.00 4897.00 0.00    -1.20 20.21
## fixed.acidity        3.80   14.20   10.40 0.65     2.17  0.01
## volatile.acidity     0.08    1.10    1.02 1.58     5.08  0.00
## citric.acid          0.00    1.66    1.66 1.28     6.16  0.00
## residual.sugar       0.60   65.80   65.20 1.08     3.46  0.07
## chlorides            0.01    0.35    0.34 5.02    37.51  0.00
## free.sulfur.dioxide  2.00  289.00  287.00 1.41    11.45  0.24
## total.sulfur.dioxide 9.00  440.00  431.00 0.39     0.57  0.61
## density              0.99    1.04    0.05 0.98     9.78  0.00
## pH                   2.72    3.82    1.10 0.46     0.53  0.00
## sulphates            0.22    1.08    0.86 0.98     1.59  0.00
## alcohol              8.00   14.20    6.20 0.49    -0.70  0.02
## quality              3.00    9.00    6.00 0.16     0.21  0.01
##   wine_id fixed.acidity volatile.acidity citric.acid residual.sugar
## 1       1           7.0             0.27        0.36           20.7
## 2       2           6.3             0.30        0.34            1.6
## 3       3           8.1             0.28        0.40            6.9
##   chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 1     0.045                  45                  170  1.0010 3.00
## 2     0.049                  14                  132  0.9940 3.30
## 3     0.050                  30                   97  0.9951 3.26
##   sulphates alcohol quality
## 1      0.45     8.8       6
## 2      0.49     9.5       6
## 3      0.44    10.1       6
##    wine_id fixed.acidity volatile.acidity citric.acid residual.sugar
## 1        1           7.0             0.27        0.36          20.70
## 2        2           6.3             0.30        0.34           1.60
## 3        3           8.1             0.28        0.40           6.90
## 4        4           7.2             0.23        0.32           8.50
## 5        5           7.2             0.23        0.32           8.50
## 6        6           8.1             0.28        0.40           6.90
## 7        7           6.2             0.32        0.16           7.00
## 8        8           7.0             0.27        0.36          20.70
## 9        9           6.3             0.30        0.34           1.60
## 10      10           8.1             0.22        0.43           1.50
## 11      11           8.1             0.27        0.41           1.45
## 12      12           8.6             0.23        0.40           4.20
## 13      13           7.9             0.18        0.37           1.20
## 14      14           6.6             0.16        0.40           1.50
## 15      15           8.3             0.42        0.62          19.25
## 16      16           6.6             0.17        0.38           1.50
## 17      17           6.3             0.48        0.04           1.10
## 18      18           6.2             0.66        0.48           1.20
## 19      19           7.4             0.34        0.42           1.10
## 20      20           6.5             0.31        0.14           7.50
## 21      21           6.2             0.66        0.48           1.20
## 22      22           6.4             0.31        0.38           2.90
## 23      23           6.8             0.26        0.42           1.70
## 24      24           7.6             0.67        0.14           1.50
## 25      25           6.6             0.27        0.41           1.30
##    chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 1      0.045                  45                  170  1.0010 3.00
## 2      0.049                  14                  132  0.9940 3.30
## 3      0.050                  30                   97  0.9951 3.26
## 4      0.058                  47                  186  0.9956 3.19
## 5      0.058                  47                  186  0.9956 3.19
## 6      0.050                  30                   97  0.9951 3.26
## 7      0.045                  30                  136  0.9949 3.18
## 8      0.045                  45                  170  1.0010 3.00
## 9      0.049                  14                  132  0.9940 3.30
## 10     0.044                  28                  129  0.9938 3.22
## 11     0.033                  11                   63  0.9908 2.99
## 12     0.035                  17                  109  0.9947 3.14
## 13     0.040                  16                   75  0.9920 3.18
## 14     0.044                  48                  143  0.9912 3.54
## 15     0.040                  41                  172  1.0002 2.98
## 16     0.032                  28                  112  0.9914 3.25
## 17     0.046                  30                   99  0.9928 3.24
## 18     0.029                  29                   75  0.9892 3.33
## 19     0.033                  17                  171  0.9917 3.12
## 20     0.044                  34                  133  0.9955 3.22
## 21     0.029                  29                   75  0.9892 3.33
## 22     0.038                  19                  102  0.9912 3.17
## 23     0.049                  41                  122  0.9930 3.47
## 24     0.074                  25                  168  0.9937 3.05
## 25     0.052                  16                  142  0.9951 3.42
##    sulphates alcohol quality  level
## 1       0.45     8.8       6 Medium
## 2       0.49     9.5       6 Medium
## 3       0.44    10.1       6 Medium
## 4       0.40     9.9       6 Medium
## 5       0.40     9.9       6 Medium
## 6       0.44    10.1       6 Medium
## 7       0.47     9.6       6 Medium
## 8       0.45     8.8       6 Medium
## 9       0.49     9.5       6 Medium
## 10      0.45    11.0       6 Medium
## 11      0.56    12.0       5 Medium
## 12      0.53     9.7       5 Medium
## 13      0.63    10.8       5 Medium
## 14      0.52    12.4       7 Medium
## 15      0.67     9.7       5 Medium
## 16      0.55    11.4       7 Medium
## 17      0.36     9.6       6 Medium
## 18      0.39    12.8       8   High
## 19      0.53    11.3       6 Medium
## 20      0.50     9.5       5 Medium
## 21      0.39    12.8       8   High
## 22      0.35    11.0       7 Medium
## 23      0.48    10.5       8   High
## 24      0.51     9.3       5 Medium
## 25      0.47    10.0       6 Medium

Now, let’s plot some univariate data :

##     wine_id     fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality         level     
##  Min.   : 8.00   Min.   :3.000   Low   :  20  
##  1st Qu.: 9.50   1st Qu.:5.000   Medium:4698  
##  Median :10.40   Median :6.000   High  : 180  
##  Mean   :10.51   Mean   :5.878                
##  3rd Qu.:11.40   3rd Qu.:6.000                
##  Max.   :14.20   Max.   :9.000

Univariate Analysis

What is the structure of your dataset?

  • The dataset consists of 4,898 observations of 14 variables (originally there were 13 variables but I created one more).

What is/are the main feature(s) of interest in your dataset?

  • I am interested in looking at the quality of wine, i.e. what chemicals characteristics are most important in predicting the quality.
  • We can see that the wines are mostly rated from 3 to 9. Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). However, only a few are very bad or excellent.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

  • Acidity like pH, alcohol, density and sugar are characteristics that might help to dig further into the analysis.
  • For example, we can ask whether wines with higher alcoholic content receive better ratings, whether sweeter wines contain a higher volume of alcohol or receive better ratings, but also what level of acidity (pH) is associated with the highest quality.

Did you create any new variables from existing variables in the dataset?

  • A new variable called “level” has been created for a better readibility of the quality of wine. We can now easily identify if a wine has a bad or high quality.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

  • I created histograms for each variables based on a log10 scale (except for the Quality) to help identify the distribution better.

Bivariate Plots Section

As mentioned in the previous section, there might be some relationships between variables. To look at the correlation, we are creating a matrix.

##        Positive  Negative    
## Small  .1 to .3  -0.1 to -0.3
## Medium .3 to .5  -0.3 to -0.5
## Large  .5 to 1.0 -0.5 to -1.0

Now, we are analizing the quality versus some variables.

Let’s take a look at the relationships between the following variables:
- alcohol & pH,
- alcohol & density,
- alcohol & residual sugar.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

  • Surprisingly, there is not many chemical characteristics that can predict the quality of wine, except for the percentage of alcohol.
  • Most of the variables have a negative correlation with quality.
  • It appears that a good wine contain a higher volume of alcohol and a higher pH, but the concentration in residual sugar is low as well as the density.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

  • We can notice a strong correlation between residual sugar & density, free sulfur dioxide & total sulfur dioxide, density & total sulfur dioxide, residual alcohol & quality.

What was the strongest relationship you found?

  • The strong relationship between % of alcohol and quality cannot be denied.
  • However, there is a strong relationship for residual sugar and density with a correlation coefficient of 0.84.

Multivariate Plots Section

Now, let’s take a look at our linear model.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + sulphates, data = wine)
## m3: lm(formula = quality ~ alcohol + sulphates + pH, data = wine)
## m4: lm(formula = quality ~ alcohol + sulphates + pH + density, data = wine)
## m5: lm(formula = quality ~ alcohol + sulphates + pH + density + free.sulfur.dioxide, 
##     data = wine)
## m6: lm(formula = quality ~ alcohol + sulphates + pH + density + free.sulfur.dioxide + 
##     total.sulfur.dioxide, data = wine)
## m7: lm(formula = quality ~ alcohol + sulphates + pH + density + free.sulfur.dioxide + 
##     total.sulfur.dioxide + chlorides, data = wine)
## m8: lm(formula = quality ~ alcohol + sulphates + pH + density + free.sulfur.dioxide + 
##     total.sulfur.dioxide + chlorides + residual.sugar, data = wine)
## m9: lm(formula = quality ~ alcohol + sulphates + pH + density + free.sulfur.dioxide + 
##     total.sulfur.dioxide + chlorides + residual.sugar + citric.acid, 
##     data = wine)
## m10: lm(formula = quality ~ alcohol + sulphates + pH + density + free.sulfur.dioxide + 
##     total.sulfur.dioxide + chlorides + residual.sugar + citric.acid + 
##     volatile.acidity, data = wine)
## m11: lm(formula = quality ~ alcohol + sulphates + pH + density + free.sulfur.dioxide + 
##     total.sulfur.dioxide + chlorides + residual.sugar + citric.acid + 
##     volatile.acidity + fixed.acidity, data = wine)
## 
## ==================================================================================================================================================================================
##                              m1            m2            m3            m4            m5            m6            m7            m8            m9           m10           m11       
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)               2.582***      2.341***      1.683***    -20.991***    -12.729*      -23.194***    -21.332***    111.361***    122.034***    109.087***    150.193***  
##                            (0.098)       (0.110)       (0.250)       (6.181)       (6.207)       (6.413)       (6.419)      (13.579)      (13.863)      (13.507)      (18.804)    
##   alcohol                   0.313***      0.314***      0.311***      0.353***      0.358***      0.352***      0.336***      0.202***      0.188***      0.239***      0.193***  
##                            (0.009)       (0.009)       (0.009)       (0.015)       (0.015)       (0.015)       (0.015)       (0.019)       (0.020)       (0.019)       (0.024)    
##   sulphates                               0.476***      0.429***      0.392***      0.360***      0.425***      0.432***      0.670***      0.659***      0.576***      0.631***  
##                                          (0.100)       (0.101)       (0.101)       (0.101)       (0.101)       (0.101)       (0.102)       (0.102)       (0.099)       (0.100)    
##   pH                                                    0.225**       0.229**       0.213**       0.233**       0.215**       0.488***      0.552***      0.466***      0.686***  
##                                                        (0.077)       (0.077)       (0.076)       (0.076)       (0.076)       (0.079)       (0.081)       (0.079)       (0.105)    
##   density                                                            22.367***     13.858*       24.578***     23.015***   -110.494***   -121.415***   -108.112***   -150.284***  
##                                                                      (6.093)       (6.125)       (6.345)       (6.347)      (13.611)      (13.909)      (13.552)      (19.075)    
##   free.sulfur.dioxide                                                               0.006***      0.009***      0.009***      0.007***      0.006***      0.004***      0.004***  
##                                                                                    (0.001)       (0.001)       (0.001)       (0.001)       (0.001)       (0.001)       (0.001)    
##   total.sulfur.dioxide                                                                           -0.002***     -0.002***     -0.002***     -0.002***     -0.000        -0.000     
##                                                                                                  (0.000)       (0.000)       (0.000)       (0.000)       (0.000)       (0.000)    
##   chlorides                                                                                                    -2.235***     -1.456**      -1.594**      -0.540        -0.247     
##                                                                                                                (0.552)       (0.550)       (0.550)       (0.539)       (0.547)    
##   residual.sugar                                                                                                              0.062***      0.066***      0.066***      0.081***  
##                                                                                                                              (0.006)       (0.006)       (0.006)       (0.008)    
##   citric.acid                                                                                                                               0.356***      0.059         0.022     
##                                                                                                                                            (0.096)       (0.095)       (0.096)    
##   volatile.acidity                                                                                                                                       -1.896***     -1.863***  
##                                                                                                                                                          (0.113)       (0.114)    
##   fixed.acidity                                                                                                                                                         0.066**   
##                                                                                                                                                                        (0.021)    
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                 0.190         0.193         0.195         0.197         0.209         0.215         0.218         0.237         0.239         0.280         0.282     
##   adj. R-squared            0.190         0.193         0.194         0.196         0.209         0.214         0.217         0.236         0.238         0.279         0.280     
##   sigma                     0.797         0.796         0.795         0.794         0.788         0.785         0.784         0.774         0.773         0.752         0.751     
##   F                      1146.395       587.145       394.902       300.301       259.103       223.866       194.831       189.967       170.827       190.448       174.344     
##   p                         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -5839.391     -5828.015     -5823.718     -5816.981     -5779.268     -5760.359     -5752.163     -5691.736     -5684.858     -5548.674     -5543.740     
##   Deviance               3112.257      3097.833      3092.402      3083.907      3036.780      3013.424      3003.356      2930.157      2921.939      2763.891      2758.329     
##   AIC                   11684.782     11664.029     11657.436     11645.962     11572.535     11536.719     11522.326     11403.472     11391.715     11121.347     11113.480     
##   BIC                   11704.272     11690.016     11689.918     11684.942     11618.011     11588.691     11580.795     11468.438     11463.177     11199.306     11197.936     
##   N                      4898          4898          4898          4898          4898          4898          4898          4898          4898          4898          4898         
## ==================================================================================================================================================================================

Multivariate Analysis


Final Plots and Summary

Plot One | Quality vs. Alcohol

Description One

  • Most of the wines are categorized with a medium quality.
  • This shows that wines with higher quality tend to have a higher volume of alcohol.
  • Alcohol is the variable that has the strongest correlation based on quality.

Plot Two | Quality vs. Alcohol & Quality vs. Density

Description Two

  • I want to show the negative correlation between alcohol and density and their impact on the quality.
  • Based on the outliers and intervals, density and alcohol do not give better wine if we had to choose one variable only to predict the wine quality.

Plot Three | Residual Sugar vs. Density vs. Quality

Description Three

  • This plot summarizes the fact that good wines contain less sugar and are less dense.

Reflection

References

https://s3.amazonaws.com/content.udacity-data.com/courses/ud651/diamondsExample_2016-05.html
http://rprogramming.net/rename-columns-in-r/
http://www.cookbook-r.com/Manipulating_data/Adding_and_removing_columns_from_a_data_frame/
https://stackoverflow.com/questions/19440069/ggplot2-facet-wrap-strip-color-based-on-variable-in-data-set
https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
https://stackoverflow.com/questions/21140798/error-using-corrplot/21141799
http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r
https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php
https://www.cyclismo.org/tutorial/R/tables.html
http://r-statistics.co/Linear-Regression.html
https://stackoverflow.com/questions/43359050/error-continuous-value-supplied-to-discrete-scale-in-default-data-set-example/43359104
http://felixfan.github.io/ggplot2-remove-grid-background-margin/
https://rstudio-pubs-static.s3.amazonaws.com/228019_f0c39e05758a4a51b435b19dbd321c23.html#1_plot_one_variable_-_x:_continuous_or_discrete